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IP- © MOVING PICTURE ENCODER. 



© An encoder comprising a television camera (12) which images an object and generates corresponding 
signals; plural microphones (11 L and 11 R) which are apart from each other to collect sound waves of voice of 
the object imaged by the television camera and output voice signals; presuming circuit (13) which presumes the 
position of the sound source based on the voice signals obtained from the microphones; and an encoding circuit 
which encodes the image signals in an image area of a given extent whose center is in the position of the sound 
source presumed by the presuming circuit by assigning them codes of an amount slightly greater than those 
assigned to image signals in the other imaging areas so that the resolution of the image of the image area of the 
given extent may be higher than that of the other areas. 
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Technical Field 

This invention relates to a coding apparatus for encoding video signals, and more particularly to a 
moving picture coding apparatus which specifies a significant portion in a picture on the basis of the audio 
5 signal sent together with the video signal, increases a coded bit rate allocated to the specified picture area, 
and thereby encodes the picture. 

Background Art 

w With the recent advance in communication technology, remote conference systems (television con- 

ference systems) and videophone systems available even for the individual have been put into practical 

In such systems, images and sound arc transmitted using communication channels such as telephone 
circuits, which therefore limits the coded bit rate transmittable per channel. To suppress the amount of 
75 picture signal data to less than the upper limit of the coded bit rate, the picture information is encoded 
before transmission. 

Since the coded bit rate transmittable per unit time is insufficient, the coded bit rate for the pictures per 
frame to ensure natural movements is determined by the transmission rate in transmitting moving pictures. 

Generally, coding is effected so that the entire screen may be uniform in resolution. This, however, 
20 causes the problem of blurring the picture of the other party's face. Normally, a person does not pay 
attention to all over the screen, but tends to concentrate on a significant portion in the screen. Therefore, 
with the picture quality of the significant portion being improved, even if the remaining portions have a 
somewhat low resolution, there is almost no problem in understanding the picture. 

Viewed in this light, coding methods have been studied which display the face area of a person, a more 
25 important source of information, more sharply than the remaining areas in order to improve the subjective 
picture quality. One of such techniques proposed is using interframe differential pictures (literature: Kamino 
et al., "A study of a method of sensing the face area in a color moving-picture TV telephone," the 1989 
Electronic information Communication Society's Spring National Meeting D-92). 

With this system, the person talking over the telephone is picked up with a television camera. From the 
30 picture signal thus obtained, moving portions in the picture are picked up. The face area of the speaker is 
estimated on the basis of the picked-up area. A large coded bit rate is allocated to the estimated face area 
and a small coded bit rate is given to the remaining areas. By performing such a coding process, the 
person's face area is displayed more sharply than the remaining areas. 

In cases where such a face-area-pickup method in a moving-picture TV telephone is applied to a 
35 conference system, when moving objects other than the person are picked up unintentionally, or when more 
than one person is picked up with each showing changes of expression, it is difficult to estimate the face 
area of the speaker. 

As described above, when more than one person is picked up or when moving objects other than a 
person are picked up, there arises the problem of being unable to extract only the face area of the speaker, 
40 the most important factor in a method of picking up the face area in a moving picture. 

Accordingly, the object of the present invention is to provide a moving-picture coding apparatus capable 
of estimating the position of the speaker in the video signal precisely, extracting the area of the speaker in 
the screen accurately, and thereby sharply displaying the area in which the speaker appears. 

45 Disclosure of Invention 

According to the present invention, it is possible to provide a moving-picture coding apparatus in an 

image transmission system for encoding and transmitting video signals, the apparatus comprising: a 
television camera for picking up a subject and generating a video signal; microphones separated from each 

50 other for collecting the vocal sound from the subject picked-up by the television camera and outputting 
audio signals; a sound source position estimating circuit for estimating the position of a sound source on the 
basis of the audio signals from the microphones; and a coding circuit for encoding at a somewhat greater 
coded bit rate than that for the remaining picture areas the video signal corresponding to the picture area 
within a specific range centered at the sound source position estimated at the sound source position 

55 estimating circuit, so that the picture area within the specific range may have a higher resolution. 

With a moving-picture coding apparatus thus constructed, the television camera picks up a subject and 
outputs a video signal. The microphones arranged separately from each other in front of the subject collect 
the vocal sound. The sound source position estimating circuit estimates the position of the sound source on 

3 
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the basis of the audio signals collected from a plurality of channels. The coding circuit encodes the video 
signal from the television camera in a manner that encodes at a somewhat greater coded bit rate than that 
for the remaining picture areas the video signal for a specific range centered at the sound source position 
estimated at the estimating circuit, so that the picture area within the specific range may have a higher 
5 resolution. 

As a result, it is possible to encode mostly the vicinity of the sound source position on the screen at 
higher resolution, with the result that moving-picture coding can be effected which enables video signals to 
be encoded so that the speaker may be displayed more sharply. In particular, by matching the picture area 
within the specific, range centered at the estimated sound source position to the range of the subject's face 
w area in the screen, the video signal can be encoded so that the face area of the speaker may have a higher 
resolution. 

Brief Description of Drawings 

75 FIG. 1 is a block diagram of a picture coding section in a television conference system according to an 
embodiment of the present invention; 

F IG. 2 is a drawing to help explain an embodiment of the present invention, which shows an arrangement 
of a conference room for a television conference system associated with the present invention; 
FIG. 3 is a block diagram of the sound source position estimating section of FIG. 1; 
20 FIG. 4A and FIG. 4B are circuit diagrams of the sound source position estimating circuit of FIG. 3; 

FIG. 5 is a drawing to help explain how the sound source position estimating circuit of FIG. 3 makes 
estimation; 

FIG. 6 is a drawing to help explain how the picture coding section of FIG. 1 determines the important 
coding area; and 

26 FIG. 7 is a block diagram of the picture coding section of FIG. 1 . 

Best Mode of Carrying Out the Invention 

Hereinafter, referring to the accompanying drawings, an embodiment of the present invention will be 
30 explained. This invention provides a picture coding apparatus employing a moving-picture coding method 
which estimates the sound source position on the basis of the audio signals from a plurality of channels, 
encodes mostly the estimated vicinity of the sound source position, and thereby effects coding so that the 
speaker may be displayed more sharply. 

FIG. 2 shows a schematic layout of a conference room for a television conference system containing a 
35 picture coding apparatus of the invention. In the figure, a single camera covers three persons at the 
conference. 

As shown in FIG. 2, on a table 9 at which attendants A1 to A3 sit, two microphones (sound-sensitive 
means) 11R and 11L are placed laterally at equal intervals so as to surround the speech of the attendants. 
In front of the table 9, there is provided a television camera 12, which covers the images of the attendants 

40 A1 to A3 sitting at the table 9 side by side. 

The audio signals from the right and left microphones 11R and 1 1L and the video signal from the 
television camera 12 are supplied to a picture estimation coding section 10, which encodes these signals so 
that they may fall within a specified coded bit rate per screen. The audio signals are also supplied to an 
audio signal processing system (not shown), which converts them into digital signals, which are then sent 

45 together with the encoded video signal to a transmission line. Thus, these signals are transmitted to the 
other party. 

The picture estimation coding section 10, acting as a picture processing system, estimates the position 

of the speaker's face area on the basis of attendants Al to A3 covered by the television camera 12, 
encodes the video signal for the estimated position area with a somewhat greater coded bit rate M(i) than 
so the video signals for the other areas, and encodes the other areas with the remaining coded bit rate M(0). 
Specifically, the total coded bit rate M (total) per screen is determined. The determined coded bit rate is 
divided into a coded bit rate (M(i)) allocated to the estimated position area and a coded bit rate (M(0)) 
allocated to the other areas. This gives: 

55 M (total) = M(i) + M(0) 

The picture estimation coding section 10 comprises a sound source position estimating section 13, a 
sound source position information storage section 14, a picture coding section 15, and an image memory 

4 
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16. The image memory 16 temporarily holds the picture data in screens obtained by converting the video 
signal from the television camera 12 into digital form. The image memory has a capacity enough to store a 
plurality of pictures for image processing and updates the picture data constantly. The sound source 
position estimating section 13 estimates the position of the sound source. Specifically, the estimating 

6 section 13 estimates the position of the speaker on the basis of the audio signal outputs from the 
microphones 11 R and 1 1 L, and simultaneously estimates the sound source position on the picture, or the 
area of the speaker, on the basis of the position of the left and right microphones 11 L and 11 Ft in the 
picture data stored in the image memory 16. The sound source position information storage section 14 
stores information on the sound source position estimated at the sound source position estimating section 

10 13 and information on the time at which the estimation was performed. Here, the time information is 
externally supplied. Furthermore, the picture estimation coding circuit 10 may be provided with a clock 
circuit, from which the time information may be supplied. 

The picture coding section 15 encodes the picture data stored in the image memory 16 on the basis of 
the information from the sound source position information storage section 14, and outputs the encoded 

w data. Specifically, the coding section encodes the video signal so that an area centering at the speaker's 
position may be displayed more clearly. To do this, the picture coding section 15 determines the area in 
the speaker's position on the picture to be the important coding area on the basis of the information on the 
speaker's position stored in the sound source position information storage section 14. Then, the coding 
section allocates the coded bit rate M(i) to the video signal for the important coding area and the coded bit 

20 rate M(0) to the video signals for the other areas, and encodes the video signals for the individual areas so 
that they may fall within the allocated ranges. 

The sound source position estimating section 13 comprises a delay circuit 31, an estimating circuit 32, 
a subtracter circuit 33, and a sound source position estimating circuit 34, as shown in FIG. 3. The delay 
circuit 31 delays the left-channel audio input signal from the left microphone 1 1 L. The estimating circuit 32 

25 estimates a left-channel audio signal on the basis of the delayed left-channel audio input signal from the 
delay circuit 31 and the right-channel audio signal from the right microphone 11R. The subtracter circuit 33 
receives the delayed left-channel audio signal from the delay circuit 31 and the estimated left-channel audio 
signal from the estimating circuit 32, and subtracts the estimated left-channel audio signal from the left- 
. channel audio signal to produce the difference signal. When the difference signal is fed back to the 

30 estimating circuit 32, the estimating circuit 32 estimates such a left-channel audio signal as allows the 
difference signal to become zero and outputs the estimated audio signal. This enables the estimating circuit 
32 to estimate a left-channel audio signal to be an estimated impulse response series H (k) on the basis of 
the right-channel audio signal from the right microphone 11R, referring to the delayed left-channel audio 
input signal. Using the estimated impulse response series H(k) obtained at the estimating circuit 32, the 

35 sound source position estimating circuit 34 estimates the position of the sound source. 

With the above configuration, the television camera 12 picks up the persons who are present at the 
conference, and simultaneously vocal sounds are collected by the microphones 11R and 11L on the table 9. 
The video signal from the television camera 12 is sent to the picture coding section 15, and the audio 
signals from the microphones 11 R and 1 1 L are sent to the sound source position estimating section 13. The 

40 sound source position estimating section 13 estimates the position of the sound source on the basis of the 
audio signals. The estimation result is stored in the sound source position information storage section 14. 

Using the latest sound-source position information stored in the sound source position information 
storage section 14, the picture coding section 15 specifies the area corresponding to the sound source 
position in the video image on the screen, encodes the area with the preset coded bit rate M(i) and the 

45 other areas with the coded bit rate M(0), and transmits the encoded signal. This enables the speaker among 
the persons who are present at the conference to be displayed at a high resolution on a monitor (not 
shown) on the reception side. 

How the speaker is specified will be explained in more detail. 

In FIG. 3, if the vocal sound uttered by speaker A1 is X(«), the vocal sound X(«) will be collected by the 
so microphones 11 R and 11L. If vocal sound X(a.) is uttered and the input audio signal to the right microphone 
1 1 R is YR(<a) and the input audio signal to the left microphone 1 1 L is YLO(«), these input audio signals YR- 
(«) and YL(oj) will be expressed as follows, using transfer functions FR(o>) and GL(o>) determined by the 
sound propagation delay between the sound source and the microphones and the audio characteristics in 
the room: 

55 

YRH = FR(«)X<«) (1) 
ULO(o>) = QL(«)X(«) (2) 
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Furthermore, the left-channel input audio signal YLO(^) undergoes a flat delay of C(<o) at the delay 
circuit 31 that assures the law of causality at the estimating circuit 32. This enables the left-channel input 
audio signal YLO(<a) to be expressed by YL(co) as follows, using a transfer function FL(o>) including the delay 
5 circuit 31: 

YL(co) = C(o))GL{co)X(co) 

= FL(co)X(w) (3) 

This left-channel input audio signal YL(<a) is inputted to the subtracter circuit 33. On the basis of the 
following equation (4), the estimation circuit 32 estimates a transfer function G(o>) to obtain the left-channel 
audio signal YL(«) from the right-channel audio signal YR(«), using the right-channel audio signal YR(<*>) and 
is the left-channel audio signal YL(a), and then generates an estimated transfer function Gp(«) from the 
transfer function G(^): 

Q( U ) - GL(u)/FR(<ij) (4) 

20 Specifically, the estimated transfer function Gp{o>) for the transfer function G(«) is generated as follows. 

Using the right-channel audio signal YR(ui), the estimating circuit 32 calculates an estimated left-channel 
audio signal yp(oj) for time areas. The estimating circuit 32 includes an adaptive transversal filter 32a for 
computing an estimated left-channel audio signal yp(k) for time areas as shown in FIG. 4A and a correction 

circuit 32b for constantly updating an estimated impulse response series Hp(k) for the transfer function G(o>) 
25 as shown in FIG. 4B. The adaptive transversal filter 32a and the correction circuit 32b operate in 
synchronization with a system clock supplied from a clock generator (not shown). The adaptive transversal 
filter 32a comprises: n-tap shift registers 41 1 to 41 n _, for sending the input audio signal YR(«) continuously 
and converting right-channel audio signals x(k) to x(k-n + 1) into the values for the individual time 
components; multipliers 42' to 42 n for multiplying, component by component, the estimated impulse 
30 responses hp1(k) to hpn(k) for the individual time components corrected at the correction circuit 32b by the 
right-channel audio signals x(k) to x(k-n-1) obtained by way of the shift registers 44 1 to 44 n .-; and an adder 
43 for finding the sum (£) of the multiplication results and obtaining an estimated left-channel audio input 
signal yp(k). 

Specifically, the correction circuit 32b performs an operation using equation (10) (explained later) to 
35 obtain estimated impulse response series hpl(k) to hpn(k), divides them by time component, and gives 

them to the corresponding multipliers 42 1 to 42 n in the adaptive transversal filter 32a. The multipliers 42i to 
42,, multiply, component by component, estimated impulse response series hp1(k) to hpn(k) by the right- 
channel audio signals x(k) to x(k-n + 1) obtained by way of shift registers 41 1 to 41 n -i, and thereby obtain 
estimated left-channel audio signals by time component. The adder 43 adds up these estimated left-channel 
40 audio signals for the individual time components and obtains an estimated left-channel audio signal yp(k). 

In such an estimating circuit 32, the right-channel audio signal x(k) is inputted to n stages of shift 
registers 41, to 41„_i which have a delay of one sample time per stage, and thereby a time series vector 
expressed by equation (5) is produced: 

45 X(k) = (x(k), x(k-1 ),..., x(k-n + 1) T (5) 

where ( ) T indicates a transposed vector. 

On the other hand, an estimated impulse response series Hp(k) approximated to the estimated transfer 
function Gp(o>) in time areas is expressed by equation (6): 

so 

Hp(k) = (hp1(k), hp2(k), hpn(k)) T (6) 

An estimated left-channel audio signal yp(k), or an estimated value of the left-channel audio signal y(k), 
can be obtained using the following equation (7): 

55 

yp(k) = Hp(k) T .X(k) (7) 

Here, when the impulse response series H for transfer function G(a>) is expressed by equation (8) 
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(where n is an integer), this means that the transfer function is estimated satisfactorily. 
H = (hi, h2, hn) T (8) 
5 Therefore, when the estimated impulse response series Hp(k) becomes: 

Hp(k) - H (9) 

the estimated left-channel audio signal yp(k) approximates the actual left-channel audio signal y(k) very 
to closely. 

Accordingly, such an estimated transfer function GpM as becomes a transfer function G(«) providing 
the relationship expressed by equation (9) has only to be found. This means that such an estimated impulse 
response series Hp(k) as allows the estimated transfer function Gp<*>) to become a transfer function GO) 

has only to be estimated. 

15 The estimation of the estimated impulse response series Hp(k) at the estimating circuit 32 is effected in 

such a manner that in the adaptive transverse filter 32a, for example, the correction circuit 32b performs the 
following operation continuously, using the time series vectors x(k) to x(k-n + 1 ) obtained as inputs and 
outputs of the n stages of shift registers 41 , to 4 1 „_ , : 

20 Hp(k + 1) = Hp(k) + «.e(k)-X(k)/||X(k)ll 2 

where Hp(0) = 0 

This algorithm is a known learning identification method. In equation (10), if e(k) is the output of the 
subtracter circuit 33 of FIG. 3 and the estimated left-channel audio signal is yp(k), the output e(k) will have 
25 the relationship expressed by equation (1 1 ): 

e(k) = y(k) - yp(k) (11) 

This means that the output e(k) of the subtracter circuit 33 is the difference signal between the left- 
30 channel audio signal y(k) and the estimated left-channel audio signal yp(k). In equation (10), a is a 
coefficient determining the converging speed and the stability of equation (10), and indicates the difference 
in distance between the left and right microphones 1 1 L and 1 1 R. 

Thus, in the picture estimation coding section 10, the position of the left and right microphones 1 1 L and 
11R is found out from the picture data stored in the image memory 16, and then the difference in distance 
35 a is determined. Using this distance difference and the output e(k) of the subtracter circuit 33, the correction 
circuit 32b performs an operation according to equation (10) and thereby estimates an estimated impulse 
response scries Hp(k). 

Based on the estimated impulse response series Hp(k) obtained through the above processing, the 
sound source estimating circuit 34 estimates the position of the sound source. The estimation is performed 
40 as follows. 

It is assumed that the term whose coefficient is the largest of the coefficients of the estimated impulse 
response series Hp{k) is Mx. Here, if the sampling period is T (sec), the speed of sound is v (m/sec), and 
the number of taps is n, the difference in distance a between the sound source and each of the left and 
right microphones 1 1 L and 1 1 R can be estimated using the following equation (12): 

45 

a = vT (Mx - N/2) (12) 

Here, as shown in FIG. 5. the left and right microphones 11L and 11R are linked to each other with a 
straight line 52, and a straight line 53 parallel to the line 52 is imagined. Then, it is assumed that the sound 

so source 51 is positioned at a specific distance away from the left and right microphones 1 1 L and 11R on the 
line 53. If the distance from the intersection of a line 54 passing perpendicularly through the mid-point Po 
between the left and right microphones 11 L and 11R on the line 52 to the sound source 51 is "a," the linear 
distance from the right microphone 11R to the sound source 51 is "b," the length of a perpendicular line 
between the line 53 passing through the sound source 51 and the line 52 passing through the microphones 

55 11L and 1 1 R is "c," and the distance between the microphones 11 L and 11 R is 2d, the following 
simultaneous equations hold: 

(b + a)2 = (d + a)2 + c2 b2 = (d - a)2 + C 2 (13) 
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By eliminating b from the simultaneous equations and solving for "a," the position of the sound source 
Pa can be estimated. 

When data on the sound source position Pa thus estimated is inputted io the picture coding section 15 

5 via the sound source position information storage section 14, a picture area centered at the sound source is 
determined to be the important coding area, and the picture data corresponding to this area is encoded with 
a greater amount of codes that the picture data for the other areas. The encoding will be explained in detail. 

The image memory 16 stores a frame of picture data, which is divided into, for example, 44 x 36 
blocks, each block consisting of 8 pixels x 8 lines, as shown in FIG. 6. The picture data stored in the image 

70 memory 16 is sent to the picture coding section 15 in blocks one after another. The picture coding section 
15 comprises an orthogonal transform (DCT) circuit 71 connected to a read-out terminal of the image 
memory 16, a quantization circuit 72 connected to the output terminal of the DCT circuit 71, a variable 
length coding circuit 73 connected to the output terminal of the quantization circuit 72, and a quantization 
step size deciding circuit 74 connected to the control terminal of the quantization circuit 72. The picture 

is coding circuit 15 further comprises a marker recognizing circuit 75 and an important coding area deciding 
circuit 76. The marker recognizing circuit 75 recognizes two markers 61a and 61b placed so as to 
correspond to the left and right microphones 1 1 L and 11R on the basis of the picture data read from the 
image memory 16, and determines the distance 2d' between the microphones 1 1 L and 11R on the screen. 
The markers are entered by the operator in the apparatus when the microphones are arranged in the 

20 conference room. 

When information on the determined distance 2d' is inputted to the important-coding-area deciding 
circuit 76, the circuit 76 obtains the distance "a"' from the mid-point of the distance 2d* to the position of 
the speaker 62 on the basis of the distance (2d') information and the sound source position information read 
from the sound source position information storage section 14, using the following equation 14. 

25 

a' = a«d'/d (14) 

Furthermore, the important-coding-area deciding circuit 76 determines an area 63 with a preset width of 
2w' centered at the speaker's position 62 to be the important coding area. When information on the 

30 important coding area is inputted to the step size deciding circuit 74, the step size deciding circuit 74 
determines a step size for encoding the picture data about the important coding area at a higher coded bit 
rate than the picture data about the other areas. When information on the determined step size is inputted 
to the quantization circuit 72, the quantization circuit 72 quantizes the picture data read from the image 
memory 16 and subjected to orthogonal transform at the DCT circuit 71 in the determined step size, or with 

35 the determined coded bit rate. In this case, quantization is effected in the step size determined at the time 
when the picture data corresponding to the important coding area 63 is inputted to the quantization circuit 
72, whereas the picture data about the other areas is quantized in a rougher step size than the picture data 
about the area 63. The quantized picture data is subjected to variable length coding at the variable length 
coding circuit 73. which outputs the coded picture data. 

40 When the picture data thus encoded is sent to the reception side and is displayed on a reception 
monitor, the image of the speaker is displayed at higher resolution than the other images. 

While in the above embodiment, it has been explained that only information on the sound source is 
stored in the sound source position information storage section 14, time information may be stored as 
follows. 

45 Specifically, the sound source position estimating section 13 causes the sound source position 
estimating circuit 34 to estimate the sound source position Pa on the basis of the term whose coefficient is 
the largest of the coefficients of the estimated impulse response series Hp(k). The information on the sound 

source position Pa estimated at the sound source position estimating section 13 and the time at which the 
estimation was effected are stored in the sound source position information storage section 14 under the 

50 control of a control unit (not shown). At this time, when the sound source position Pa(t) time t ago is within a 
specific width of w from the latest sound source position Pa to the right and to the left, the control unit 
controls the sound source position information storage section 14 so that the stored information about the 
past sound source position Pa(t) may be erased from the storage section 14. This allows the storage 
section 14 to store the position of the current speaker and the last position of each of the persons (N 

55 persons) who spoke in the past as follows: 
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T(l) . 
T(2) , 



L(l) 

L(2) 



T(N), L(N) 

10 provided that T(l) < T{2) < ... < T(N) (lb } 

where T(i) is the time elapsed since speaker i uttered a vocal sound last, L(i) is the data indicating the 
position where speaker i uttered a vocal sound last, T(1) is the time at which the above operation is 
75 performed by the sampling of the vocal sound of the current speaker, and L(1) is the data indicating the 
position where the current speaker uttered a vocal sound. 

The picture coding section 15 encodes a picture as described above, on the basis of the information on 
the position L(1) of the latest speaker stored in the sound source position information storage section 14. 

It is assumed that the coded bit rate for the entire screen is M, the width of the entire screen is W L , the 
20 importance of the important coding area for speaker i is R(i), and the importance of the areas other than the 
important coding area is R(0). At this time, importance R(i) and R(0) can be set freely, if greater importance 
is given to a person who spoke more recently, setting can be effected as follows: 

R(1) > R(2) > ... > R(N) > R(0) (16) 

25 

At this time, importance is allocated so that coded bit rate M(i) for the important coding area for the 
latest speaker (the picture area for the latest speaker), and coded bit rate M(0) for the areas other than the 
important coding area may be expressed as: 

30 M(i) = M-w'»R(i)/RT 

M(0) = M«(W L - N.w')R(0)/RT 

where RT is expressed as: 

35 RT = w' (R(1) + R(2) + ... + R(N)) + (WL- N-w')R(O) (17) 

Therefore, by allocating a somewhat larger coded bit rate M(i) to the important coding area for speaker 
i and (he remaining coded bit rate M(0) to the other areas and carrying out an encoding operation within the 
allocated ranges, encoding can be effected so that an area centered at the position of the speaker may be 
40 displayed more clearly. Consequently, although the total coded bit rate per screen does not differ from that 
in a conventional equivalent, a subjective picture quality of the entire screen can be improved. 

As described above, the position of the sound source is estimated on the basis of the channel audio 
signals collected by microphones arranged in different positions and the microphone position on the image 
screen including the microphone and speaker. This enables the picture area of the speaker on the image 
45 screen can be extracted accurately. In addition to this, allocating a larger coded bit rate to the picture area 
of the speaker enables the moving-picture coding system to display the picture area of the speaker clearly. 

The present invention is not limited to the above embodiment, but may be practiced or embodied in still 
other ways without departing from the spirit or essential character thereof. 

For instance, while in the above embodiment, the adaptive transversal filter for time areas is used in the 
so estimating circuit 32 of the sound source position estimating section 13, another circuit configuration such 
as an adaptive transversal filter for frequency areas may be used instead. Although the estimating algorithm 
has been explained using a learning identification method as an example, another learning algorithm such 
as a steepest descent method may be used. 

While in the sound source estimating circuit 34, the position of the sound source is estimated on the 
55 basis of the term whose coefficient is the largest of the coefficients of the estimated impulse response 
series Hp(k), another method may be used. 

The method of determining the important coding area in the picture coding section 15 is not restricted 
to the above-described method. For instance, another method such as sensing the face area in the 
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important coding area 63 may be used. Setting the degree of importance at the picture coding section 15 
may be effected by other methods such as setting the degree of importance according to the time for which 
the speaker has uttered a vocai sound up to the present time, or setting the degree of importance taking 
into account both the time elapsed since the speaker spoke last and the time for which the speaker has 
el'.-. :., i n.\.<*.l <m: R. t:i... p: ■...:•-< it in ;i- 

In a television conference system, since the subjects almost sit stiil and the television screen is held at 
the same view angles with respect to the subjects, the subjects on the screen remain unchanged in position 
unless they themselves move. Therefore, by externally setting the degree of importance or the importance 
coding area at the picture coding section 15, a VIP can always be encoded very precisely. Because the 

70 relationship between the screen and the subject remains unchanged, it is easy to specify the speaker's face 
area, not the speaker's picture area. Thus, the configuration may be such that coded bit rate is allocated so 
as to increase the resolution of the specified face area. 

While in the above embodiment, the technique of allocating a larger coded bit rate to the important 
coding area 63 in each frame and performing a precise coding has been explained for the coding method at 

15 the picture coding section 15, a precise coding may be effected by bringing the portions other than the 
important coding area 63 into a time-lapse state and thereby allocating a larger coded bit rate to the 
important coding area 63. The resolution may be changed according to the weighting corresponding to the 
order in which the speakers uttered a vocal sound, in such a manner that the highest resolution is given to 
the latest speaker and the lowest resolution is given to the earliest speaker in chronological order of 

20 speakers. 

While in the above embodiment, two channels are used for audio inputs, three or more channels may 
be used. In this case, by arranging microphones so as to make a difference in height, a two-dimensional 
estimation of the sound source can be made. By this approach, a single point on the screen can be 
estimated as the sound source, thereby enabling the sound source position to be estimated at a much 
25 higher accuracy. 

Industrial Applicability 

■ According to the above-described invention, by estimating the position of the sound source on the basis 
30 of a plurality of channel audio signals and encoding mostly the vicinity of the sound source position, it is 
possible to provide a moving-picture coding system which performs encoding so that the speaker may 
appear more clearly. 

Claims 

35 

1. A moving-picture coding apparatus comprising: 

image pickup means for picking up at least one subject uttering a vocal sound and outputting a 
video signal; 

a plurality of sound-sensitive means which are arranged so as to be separate from each other and 
40 which collect a vocal sound from the subject filmed by said filming means and output audio signals; 

estimating means for estimating the position of the sound source on the basis of the audio signals 
outputted from said plurality of sound-sensitive means; and 

coding means for encoding the video signal corresponding to a specific range of picture area 
centered at the sound source position estimated by said estimating means with a larger coded bit rate 
45 than the video signal corresponding to the other picture areas. 

2. A moving-picture coding apparatus according to claim 1, wherein said sound-sensitive means com- 
prises right and left microphones which are arranged from right to left with respect to a plurality of 
subjects and which produce audio signals, for right and left channels, and said estimating means 

so comprises a delay circuit for delaying a left-channel audio signal from said left microphone, an 

estimating circuit for estimating a left-channel audio signal on the basis of the delayed left-channel 
audio signal from said delay circuit and a right-channel audio signal from said right microphone, a 
subtracter circuit for obtaining a difference signal between the delayed left-channel audio signal from 
said delay circuit and the estimated left-channel audio signal from said estimating circuit, and a sound 

55 source position estimating circuit which estimates such an estimated teft-channel audio signal as allows 

said difference signal to become zero when said difference signal is fed back to said estimating circuit, 
and which estimates the position of the sound source using an estimated impulse response series 
outputted from said estimating circuit. 

10 
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3. A moving-picture coding apparatus according to claim 2, wherein said estimating circuit comprises an 
adaptive transversa! filter for calculating an estimated left-channel audio signal for time areas, and a 
correction circuit for updating the estimated impulse response series constantly. 

4. A moving-picture coding apparatus according to claim 3, wherein said adaptive transversal filter 
comprises an n-tap shift register for shifting a right-channel audio signal consecutively and converting 
the audio signal into the value for each time component, a multiplier for multiplying the estimated 
impulse response for each time component corrected by said correction circuit by each component of 
the right-channel audio signals obtained by way of said shift register, and an adder for finding the sum 
of the multiplication results and producing an estimated left-channel audio input signal. 

5. A moving-picture coding apparatus according to claim 4, wherein said correction circuit contains circuit 
means for obtaining an estimated impulse response series, dividing the scries by time component, and 
supplying the divided series to the corresponding multipliers of said adaptive transversal filter, said 
multipliers of said adaptive transversal filter multiply, component by component, an estimated impulse 
response series by the right-channel audio signai obtained by way of said shift register, and output an 
estimated left-channel audio signal for each time component, and said adder adds the estimated left- 
channel audio signals for the individual time components to produce an estimated left-channel audio 
signal. 

6. A moving-picture coding apparatus comprising: 

image pickup means for picking up at least one subject uttering a vocal sound and outputting an 
video signai; 

a plurality of sound-sensitive means which are arranged so as to be separate from each other and 
which collect sound from the subject filmed by said filming means and output audio signals; 

estimating means for estimating the position of the sound source on the basis of the audio signals 
outputted from said plurality of sound-sensitive means; 

sound source position storage means for storing the history of information on the present and past 
positions of the sound source estimated by said estimating means; and 

coding means for encoding the video signal with a coded bit rate corresponding to the position on 
the basis of the history of the sound source position' information and the past sound source position 
information stored in said sound source position storage means. 

7. A moving-picture coding apparatus according to claim 6, wherein said picture coding means deter- 
mines at least one sound source position stored in said sound source position storage means and its 
vicinity to be a high picture-quality area, sets each picture-quality level, allocates a coded bit rate so 
that the area may have a higher picture quality according to said picture-quality level than the other 
areas, and encodes the video signal. 

8. A moving-picture coding apparatus according to claim 6, wherein said picture coding means has the 
function of externally setting a high picture-quality area and picture-quality levels and encoding the 
video signal by allocating a coded bit rate so that the area may have a higher picture quality than the 
other areas. 

9. A moving-picture coding apparatus according to claim 6, wherein said sound source position estimating 
means performs a sensing operation on the basis of at least one of the delay difference, phase 
difference, and level difference between the audio signals of said plurality of channels. 

10. A moving-picture coding apparatus according to claim 8, wherein said picture coding means sets 
picture-quality levels according to how often the sound source position appears. 

11. A moving-picture coding apparatus according to claim 6, wherein said sound-sensitive means com- 
prises right and left microphones which are arranged from right to left with respect to a plurality of 
subjects and which produce audio signals for right and left channels, and said estimating means 
comprises a delay circuit for delaying a left-channel audio signal from said left microphone, an 
estimating circuit for estimating a left-channel audio signal on the basis of the delayed left-channel 
audio signal from said delay circuit and a right-channel audio signal from said right microphone, a 
subtracter circuit for obtaining a difference signal between the delayed left-channel audio signal from 
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said delay circuit arid the estimated left-channel audio signal from said estimating circuit, and a sound 
source position estimating circuit which estimates such an estimated left-channel audio signal as allows 
said difference signal to become zero when said difference signal is fed back to said estimating circuit, 
and which estimates the position of the sound source using an estimated impulse response series 
5 outputted from said estimating circuit. 

12. A moving-picture coding apparatus according to claim 11, wherein said estimating circuit comprises an 
adaptive transversal filter for calculating an estimated left-channel audio signal for time areas, and a 
correction circuit for updating the estimated impulse response series constantly. 

70 

13. A moving-picture coding apparatus according to claim 12, wherein said adaptive transversal filter 
comprises an n-tap shift register for shifting a right-channel audio signal consecutively and converging 
the audio signal into the value for each time component, a multiplier for multiplying the estimated 
impulse response for each time component corrected by said correction circuit by each component of 

75 the right-channel audio signals obtained by way of said shift register, and an adder for finding the sum 

of the multiplication results and producing an estimated left-channel audio input signal. 

14. A moving-picture coding apparatus according to claim 13, wherein said correction circuit contains 
circuit means for obtaining an estimated impulse response series, dividing the series by time 

20 component, and supplying the divided series to the corresponding multipliers of said adaptive 

transversal filter, said multipliers of said adaptive transversal multiply, component by component, an 
estimated impulse response series by the right-channel audio signal obtained by way of said shift 
register, and output an estimated left-channel audio signal for each time component, and said adder 
adds the estimated left-channel audio signals for the individual time components to produce an 

25 estimated left-channel audio signal. 
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